Quick analysis and similarity network building¶
%pylab inline
import matplotlib.pyplot as plt
import pandas as pd, numpy as np
import seaborn as sns
sns.set(style="whitegrid")
import sqlite3
Data¶
The data used here are downloaded data from the Million Song Dataset:
- SQLite database containing most metadata about each track (NEW VERSION 03/27/2011).
- SQLite database linking artist ID to the tags (Echo Nest and musicbrainz ones).
Pandas can read data directly from the SQLite databases.
Connect to both db¶
conn_terms = sqlite3.connect("data/artist_term.db")
conn_artists = sqlite3.connect("data/track_metadata.db")
Get the mbtags¶
These are human selected tags from musicbrainz.org. They include a large variety of tags: countries, genre, languages.
mbtags = pd.read_sql("SELECT * FROM artist_mbtag",conn_terms)
print("{:,}".format(len(mbtags["artist_id"].unique())), "unique artists, and", \
"{:,}".format(len(mbtags["mbtag"].unique())), "unique tags;", \
"with an average of {0:0.2f}".format(cnt_art_tag["countTags"].mean()), "per artist")
cnt_art_tag = pd.pivot_table(mbtags,values="mbtag",index="artist_id",\
aggfunc="count").reset_index().rename(columns={"mbtag":"countTags"})
cnt_art_tag.hist(bins=21)
The distribution of count tag shows that mosts songs will have only a few tags. It will be hard to build a network of similarity with these!
Build a network¶
Source, target and edges¶
The data we use is a list of tuples ( artist_id
, mbtag
) for 8,838 unique artists, with 1 to 20 tags (average: 2.8).
We want to create an undirected network of artists who share common tags:
- Sources and targets are the artists (we copy the same dataframe and rename the
id
s.) - Edges will be created by merging the source on the target, using the tags as a common field. Then we remove the edges where the source and targets are the same.
source = mbtags.copy().rename(columns={"artist_id":"Source"})
target = mbtags.copy().rename(columns={"artist_id":"Target"})
edges = source.merge(target,on="mbtag",how="inner")
edges = edges[edges["Source"] != edges["Target"]]
print("There are", "{:,}".format(len(source)), \
"sources/targets connected by","{:,}".format(len(edges)), "links")
mbtags["nb_tags"] = mbtags.groupby(["artist_id"])["mbtag"].transform("count")
len(mbtags[mbtags["nb_tags"]>4]),len(mbtags[mbtags["nb_tags"]>2]),len(mbtags),
Remove redundant edges¶
This network has 8 millions edges.. with many redundant edges, so we'll add a weight to the edges and drop the duplicates (keep only one link).
edges["weight"] = edges.groupby(["Source","Target"])["mbtag"].transform("count")
edges = edges[["Source","Target","weight"]].drop_duplicates()
len(edges)
tracks = pd.read_sql("SELECT artist_id,artist_name FROM songs",conn_artists)
artists = tracks.drop_duplicates()
print("There were", "{:,}".format(len(tracks)),"tracks, from", "{:,}".format(len(artists)), "artists")
If we look at the unique names of artists and ids, we realise that there is a discrepancy:
print(len(artists.artist_id.unique()), "unique artists")
print(len(artists.artist_name.unique()), "unique artist names")
Argh, it seems that some artist_id
have duplicate names!
e.g. below for AROQS6O1187FB3D146
artists[artists["artist_id"] == "AROQS6O1187FB3D146" ]
Looking at it, some are mistakes, but most are simply variations of the name), so we'll drop the duplicate artist_id
and keep only the first corresponding artist_name
.
artists.drop_duplicates("artist_id",inplace=True)
len(artists),len(artists.artist_id.unique()),len(artists.artist_name.unique())
Add Tags list¶
The tags are currently in one column, with one value per row. We want them in one row per artist, so we can use a clever join
with the pandas groupby
.
mbtags["Tags"] = mbtags.groupby(['artist_id'])['mbtag'].transform(lambda x: ','.join(x))
artists = artists.merge(mbtags.drop_duplicates("artist_id"),how="left")
len(artists),artists.columns
artists[artists["artist_id"] == "AR00GVV11C8A415A54"]
Add source names¶
edges = edges.merge(artists.rename(columns={"artist_id":"Source"}),\
how="left").rename(columns={"Source":"Source_id","artist_name":"Source"})
len(edges)
edges.head()
Add target names¶
edges = edges.merge(artists,left_on="Target",right_on="artist_id",how="left")
edges = edges.rename(columns={"Target":"Target_id","artist_name":"Target"})[["Source","Target","Source_id",\
"Target_id","weight"]]
len(edges),edges.columns,len(edges[edges["weight"] > 3])
We will keep only the most connected edges: with more than 3 links. We export to csv
to use in gephi.
edges[edges["weight"] > 3][["Source","Target","weight","Source_id","Target_id"]].to_csv("edges.csv",index=None)
Get the nodes from the edges¶
nodes = pd.DataFrame(edges[edges["weight"] > 3]["Source_id"].tolist() + \
edges[edges["weight"] > 3]["Target_id"].tolist(),\
columns=["artist_id"]).drop_duplicates()
nodes = nodes.merge(mbtags[["artist_id","Tags"]],how="left").drop_duplicates()
len(nodes),nodes.columns
nodes = nodes.merge(artists,how="left")[["artist_name","Tags"]].rename(columns={"artist_name":"id"}).drop_duplicates()
We export to csv
to use them in Gephi.
nodes.to_csv("nodes.csv",index=None)